Predicting Beer Review Scores: A Short Machine Learning Project

Assigment instructions:

Author's note: we will do the EDA before the modelling.

Importing Libraries

Import Data & Initial look at the dataset

Here, we import the dataset and examine the distributions more closely. We will not yet add the second variable onto the visualization.

EDA on the entire dataset with two selected features

Here, based on our intuition, we hypothesize that taste review scores and aroma review scores will be among the best predictors of overall review scores because the taste and smell of a beer are the two most immediately accessible and most apparent features of a beer.

The relationship between overall review score and taste review score is as expected and seems strong. The higher the overall review score, the higher the average taste score.

The relationship between overall review score and aroma score is as expected and fairly strong. The higher the overall review score, the higher the average aroma score.

In comparison with the taste scores, however, the aroma score seems slightly less strongly correlated, spanning a range of (4.3 - 1.9 = 2.4) for overall review scores between 1 and 5, whereas taste scores span a range of (4.6 - 1.3 = 3.3) for overal review scores between 1 and 5.

Machine Learning Pipeline: Train-Validation-Test Split

Baseline Model: Linear Regression

Now let's use review_taste and review_aroma to predict review_overall. We surmised that taste and aroma are the most importance factors in predicting overall appreciation of a beer.

The above results are already very good, with an R2 of 0.63, meaning that the model is able to explain about 63% of the variance. This is already pretty good. But we will be able to further improve later.